In [1]:
import pandas as pd
import numpy as np
from io import StringIO
import numpy.linalg as la
import matplotlib.pyplot as plt
from matplotlib import cm as cm
import seaborn as sns
sns.set(font_scale=2)
plt.style.use('seaborn-whitegrid')
%matplotlib inline

Machine Learning

For this activity, you will explore the basics of machine learning. Machine learning describes a class of methods for automatically building mathematical models based on training data. The dataset that we will work with will be a dataset of Pokemon.

In this activity, you will:

  • explore the data using visualization tools
  • split your data into training and test sets
  • create a model to predict whether a Pokemon is legendary or not based on the Pokemon properties.

Load the dataset

In [2]:
# data source: https://www.kaggle.com/abcsds/pokemon/downloads/pokemon.zip/2
df = pd.read_csv("Pokemon.csv")

In the dataset, each row represents a Pokemon. How many Pokemon are in our dataset? How many features are in this dataset?

You can inspect the first few lines of your data using df.head( )

Define an array y, such that it contains whether a given Pokemon is legendary or not. The $i$th entry of y denotes whether the $i$th Pokemon is legendary (True) or not (False). We will later use a classification algorithm to help predict if a Pokemon is legendary.

Not every classifier can work with string or boolean types. Instead of having the array y as booleans, we can replace True with 1 and False with 0.

What are the features in our data that can be used to determine the legendary status of a Pokemon?

Save these features in the variable labels. Hint: there are 7 features.

Create another dataframe (name it X) with the relevant features.

Then get the numpy array x with the values of the DataFrame X

Splitting the dataset

To assess the model’s performance later, we divide the dataset into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained model.

We are going to use sklearn.model_selection.train_test_split to split the dataset

In [11]:
from sklearn.model_selection import train_test_split

A common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You should select this proportion by assigning the variable s and setting the argument test_sizes = s in sklearn.model_selection.train_test_split.

In [12]:
s = 0.33

We will fix the seed for the random number generator, in order to get reproducible results

In [13]:
seed = 41

Split the arrays x and y into training data (X_train,Y_train) and test data (X_test,Y_test)

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Logistic regression

Now that we have a dataset to train our model and a dataset to validate our model, we need to construct a model.

To introduce this, we will begin by using a logistic regression model. This is used for classification tasks where data points can only be a member of one class. The model can be solved either using a modified version of least squares or newton's method.

In [15]:
from sklearn.linear_model import LogisticRegression

Using the LogisticRegression function, make an instance of the model. Use all the default parameters for now.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [16]:
model = LogisticRegression(solver="lbfgs")

Using this instance of the model, let's use the training data to train the model. Use model.fit(X_train, Y_train) to train the model.

Model Prediction

We now have a trained model and we can begin using it to make predictions. Recall that we want to use our model to predict whether a Pokemon is legendary or not.

Use the model to predict whether the Pokemon in the test dataset X_test are legendary. You can use the model to make predictions using the predict function

model.predict(X_test)
In [19]:
# these are the legendary Pokemon
print(Ypredict.sum())
print(Y_test.sum())
14
19

But, do we know how good our prediction is? How can we measure how good our model is?

One way of determining the performance of our model is using a confusion matrix. A confusion matrix describes the performance of the classification model on a set of test data for which the true values are known. A confusion matrix stores the true positives, false positives, false negatives, and true negatives for our test data.

In [20]:
from sklearn.metrics import confusion_matrix

Let's use the confusion_matrix function in sklearn to construct a confusion matrix for our dataset.

In [21]:
cmat = confusion_matrix(Y_test,Ypredict)

print("confusion matrix:\n",cmat)

TN, FP, FN, TP = cmat.ravel()
confusion matrix:
 [[239   6]
 [ 11   8]]

$$ \text{Confusion matrix} = \left[ \begin{array} {cccc} TN & FP\\ FN&TP \end{array} \right] $$

TN: Predicted no (not engendary), and the pokemon is not legendary. (How many non-legendary pokemons are correctly identified?)

FP: Predicted yes (legendary), but the pokemon is not legendary. (How many non-legendary pokemon are identified as legendary? )

FN: Predicted no (not lengendary), but the pokemon is actually legendary. (How many legendary pokemon are missed?)

TP: Predicted yes (legendary), and the pokemon is legendary. (How many legendary pokemons are correctly identified? )

There are different "scores" to quantify how good the model is. Here are some of them:

In [22]:
from sklearn.metrics import accuracy_score

accuracy_score(Y_test, Ypredict)
Out[22]:
0.9356060606060606

2) Precision: when it predicts yes (legendary), how often is the prediction correct? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [24]:
TP/(TP+FP)
Out[24]:
0.5714285714285714
In [25]:
from sklearn.metrics import precision_score

precision_score(Y_test, Ypredict)
Out[25]:
0.5714285714285714

3) Recall: when actually yes (legendary), how often is the prediction correct? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [26]:
TP/(TP+FN)
Out[26]:
0.42105263157894735
In [27]:
from sklearn.metrics import recall_score

recall_score(Y_test, Ypredict)
Out[27]:
0.42105263157894735

Different Models:

Starting with an initial dataset, we learned how to prepare the data, split the data, construct a model, and then use the model using sklearn.

Let's try and repeat this experiment now but with a different model. Below are 5 different classifiers (models) found in sklearn. Compare your results for each of the classifiers. Which works best for the task of determining legendary status of a Pokemon?

In [28]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC